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Abstract 

We propose a general framework for studying adaptive regret bounds in the online learning framework, 
including model selection bounds and data-dependent bounds. Given a data- or model-dependent bound 
we ask, “Does there exist some algorithm achieving this bound?” We show that modifications to recently 
introduced sequential complexity measures can be used to answer this question by providing sufficient 
conditions under which adaptive rates can be achieved. In particular each adaptive rate induces a set of 
so-called offset complexity measures, and obtaining small upper bounds on these quantities is sufficient to 
demonstrate achievability. A cornerstone of our analysis technique is the use of one-sided tail inequalities 
to bound suprema of offset random processes. 

Our framework recovers and improves a wide variety of adaptive bounds including quantile bounds, 
second-order data-dependent bounds, and small loss bounds. In addition we derive a new type of adaptive 
bound for online linear optimization based on the spectral norm, as well as a new online PAC-Bayes 
theorem that holds for countably infinite sets. 


1 Introduction 

Some of the recent progress on the theoretical foundations of online learning has been motivated by the 
parallel developments in the realm of statistical learning. In particular, this motivation has led to martingale 
extensions of empirical process theory, which were shown to be the “right” notions for online learnability. 
Two topics, however, have remained elusive thus far: obtaining data-dependent bounds and establishing 
model selection (or, oracle-type) inequalities for online learning problems. In this paper we develop new 
techniques for addressing both these topics. 

Oracle inequalities and model selection have been topics of intense research in statistics in the last two 
decades [I, 2, 3]. Given a sequence of models Adi, AI 2 ,... whose union is Ad, one aims to derive a procedure 
that selects, given an i.i.d. sample of size n, an estimator / from a model Mm that trades off bias and 
variance. Roughly speaking the desired oracle bound takes the form 

err(/)<inff inf err(/) + pen„(77i) V, 

™ [f^Mm J 

where pen„(TO) is a penalty for the model m. Such oracle inequalities are attractive because they can be 
shown to hold even if the overall model Ad is too large. A central idea in the proofs of such statements (and 
an idea that will appear throughout the present paper) is that pen„(m) should be “slightly larger” than 
the fluctuations of the empirical process for the model m. It is therefore not surprising that concentration 
inequalities—and particularly Talagrand’s celebrated inequality for the supremum of the empirical process— 
have played an important role in attaining oracle bounds. In order to select a good model in a data-driven 
manner, one first establishes non-asymptotic data-dependent bounds on the fluctuations of an empirical 
process indexed by elements in each model (see the monograph [4]). 

Lifting the ideas of oracle inequalities and data-dependent bounds from statistical to online learning is 
not an obvious task. For one, there is no concentration inequality available, even for the simple case of a 
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sequential Rademacher complexity. (For the reader already familiar with this complexity: a change of the 
value of one Rademacher variable results in a change of the remaining path, and hence an attempt to use a 
version of a bounded difference inequality grossly fails). Luckily, as we show in this paper, the concentration 
machinery is not needed and one only requires a one-sided tail inequality. This realization is motivated by 
the recent work of [5, 6, 7]. At the high level, our approach will be to develop one-sided inequalities for the 
suprema of certain offset processes [7] , with an offset that is chosen to be “slightly larger” than the complexity 
of the corresponding model. We then show that these offset processes also determine which data-dependent 
adaptive rates are achievable for a given online learning problem, drawing strong connections to the ideas of 
statistical learning described earlier. 


1.1 Framework 

Let X be the set of observations, V the space of decisions, and y the set of outcomes. Let A(S') denote the 
set of distributions on a set S. Let .£:2?x3^-s-Kbea loss function. The online learning framework is defined 
by the following process: For t = 1,... ,n. Nature provides input instance cct e A; Learner selects prediction 
distribution qt e A(T>); Nature provides label yt e y, while the learner draws prediction yt ~ qt and suffers 
loss i{yt,yt)- 

Two specific scenarios of interest are supervised learning (I’ £ K, 2? £ K) and online linear (or convex) 
optimization (A = {0} is the singleton set, y and V are unit balls in dual Banach spaces and £.{y, y) = {y, y)). 
For a class T £ , we define the learner’s cumulative regret to JF as 

n n 

'ZKyt,yt) - inf ^£(/(xt),j/t). 

t=i t=\ 

A uniform regret bound Bn is achievable if there exists a randomized algorithm for selecting yt such that 


E 


'ZKyt,yt) - inf 'Zi{fixt),yt) 

t=l t=l 


^ VXiiTi, yi:7j 


( 1 ) 


where ai.n stands for {oi,... ,a„}. Achievable rates depend on complexity of the function class T. For 
example, sequential Rademacher complexity of T is one of the tightest achievable uniform rates for a variety 
of loss functions [8, 7]. 

An adaptive regret bound has the form Bn{f]Xi-,n,yi-.n) and is said to be achievable if there exists a 
randomized algorithm for selecting ijt such that 


E 


n 


n 


.t=l t=l 


— Bn(^f , Xi-,n: yi\n') 


yxi,n,yi-.n, V/6jP. 


( 2 ) 


We distinguish three types of adaptive bounds, according to whether Bn{f\Xi-,n,yi-.n) depends only on 
/, only on {xi-.n,yi:n), or on both quantities. Whenever depends on /, an adaptive regret can be viewed 
as an oracle inequality which penalizes each / according to a measure of its complexity (e.g. the complexity 
of the smallest model to which it belongs). As in statistical learning, an oracle inequality (2) may be proved 
for certain functions Bn{f',xi-,n,yi-.n) even if a uniform bound (1) cannot hold for any nontrivial Bn- 


1.2 Related Work 

The case when Bn{f',xi-,n,yi-.n) = Bn{xi-,n,yi-.n) does not depend on / has received most of the attention in 
the literature. The focus is on bounds that can be tighter for “nice sequences,” yet maintain near-optimal 
worst-case guarantees. An incomplete list of prior work includes [9, 10, 11, 12], couched in the setting of 
online linear/convex optimization, and [13] in the experts setting. 

The present paper was partly motivated by the work of [14] who presented an algorithm that competes 
with all experts simultaneously, but with varied regret with respect to each of them, depending on the quantile 
of the expert. This is a bound of the type Bn{f) (dependent only on /, where / denotes the quantile we 
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compete against) for the finite experts setting. The work of [15] considers online linear optimization with an 
unbounded set and provides oracle inequalities with an appropriately chosen function S„(/). 

Finally, the third category of adaptive bounds are those that depend on both the hypothesis / € JF and 
the data. The bounds that depend on the loss of the best function (so-called “small-loss” bounds, [16, Sec. 
2.4], [17, 13]) fall in this category trivially, since one may overbound the loss of the best function by the 
performance of /. We would like to draw attention to the recent result of [18] who show an adaptive bound 
in terms of both the loss of comparator and the KL divergence between the comparator and some pre-fixed 
prior distribution over experts. An MDL-style bound in terms of the variance of the loss of the comparator 
(under the distribution induced by the algorithm) was recently given in [19]. 

Our study was also partly inspired by Cover [20] who characterized necessary and sufficient conditions 
for achievable bounds in prediction of binary sequences. The methods in [20], however, rely on the structure 
of the binary prediction problem and do not readily generalize to other settings. 

The framework we propose recovers the vast majority of known adaptive rates in literature, including 
variance bounds, quantile bounds, localization-based bounds, and fast rates for small losses. It should be 
noted that while existing literature on adaptive online learning has focused on simple hypothesis classes 
such as finite experts and finite-dimensional p-norm balls, our results extend to general hypothesis classes, 
including large nonparametric ones discussed in [7]. 


2 Adaptive Rates and Achievability: General Setup 

The first step in building a general theory for adaptive online learning is to identify what adaptive regret 
bounds are possible to achieve. Recall that an adaptive regret bound of : .F x A" x y” ^ M is said to 
be achievable if there exists an online learning algorithm that produces predictions/decisions such that (2) 
holds. 

In the rest of this work, we use the notation ((.. to denote the interleaved application of the operators 
inside the brackets, repeated over t = rounds (see [21]). Achievability of an adaptive rate can be 

formalized by the following minimax quantity. 

Definition 1. Given an adaptive rate Bn we define the offset minimax value: 

An{iF, Bn) = li sup inf sup E \\ 

\\xtiX qtiA(v) vt^y 

An{iF,Bn) quantifies how Ejli ^(yt> l/t) “ {Ejli ^(/(^^t), 2 /t) + S„(/; Xi;„, j/i;„)} behaves when the 

optimal learning algorithm that minimizes this difference is used against Nature trying to maximize it. 
Directly from this definition. 

An adaptive rate Bn is achievable if and only if A„(JF, 6„) < 0. 

If Bn is a uniform rate, i.e., Bn{f',Xi-.n,yi:n) = Bn, achievability reduces to the minimax analysis explored in 
[8]. The uniform rate Bn is achievable if and only if Bn > V„(.F), where Vn{iF) is the minimax value of the 
online learning game. 

We now focus on understanding the minimax value A„(JF, S„) for general adaptive rates. We first show 
that the minimax value is bounded by an offset version of the sequential Rademacher complexity studied 
in [8]. The symmetrization Lemma 1 below provides us with the first step towards a probabilistic analysis 
of achievable rates. Before stating the lemma, we need to define the notion of a tree and the notion of 
sequential Rademacher complexity. 

Given a set Z, a Z-valued tree z of depth n is a sequence (zt)[b]^ of functions zj : {±1}*“^ Z. One may 
view z as a complete binary tree decorated by elements of Z. Let e = (et)[bi be a sequence of independent 
Rademacher random variables. Then (zt(e)) may be viewed as a predictable process with respect to the 
filtration St = cr{ei ,... ,et). For a tree z, the sequential Rademacher complexity of a function class G £ 
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on z is defined as 


TZn{G,z) =E<;Sup^et5(zt(e)), 

g^G t=i 

and we denote TZniG) - sup^. z). Let Zi;„(e) = (zi(e),... ,z„(e)) be the labels of the tree z along the 

path given by e. 

Lemma 1. For any lower semi-continuous loss t, and any adaptive rate Bn that only depends on outcomes 
(i.e. Bn{f;xi-.n,yi-.n) =B„{yi,n)), we have that 


An < sup Eg 


sup p ^ etf(/(xt(e)),yt(e)) 


■ ^«(yi:n(e)) 


(3) 


Further, for any general adaptive rate Bn, 


An < sup Eg 

x,y,y' 


sup 


n 

2 E eiA/(xt(e)),yt(e)) - S„(/; xi;„(e),yE+i(e)) 

i=l 


(4) 


Finally, if one considers the supervised learning problem where JT : A” ^ K, 3^ c M and £:IRxIR->IRzso loss 
that is convex and L-Lipschitz in its first argument, then for any adaptive rate Bn, 


An < sup Eg 
x,y 


sup I 2 L E et/(xt(e)) - Bn{f-, xi;„(e), yi:„(e)) 


t = l 


(5) 


The above lemma tells us that to check whether an adaptive rate is achievable, it is sufficient to check 
that the corresponding adaptive sequential complexity measures are non-positive. We remark that if the 
above complexities are bounded by some positive quantity of a smaller order, one can form a new achievable 
rate B'^ by adding the positive quantity to Bn- 


3 Probabilistic Tools 

As mentioned in the introduction, our technique rests on certain one-sided probabilistic inequalities. We 
now state the first building block: a rather straightforward maximal inequality. 

Proposition 2. Let I = {1,..., N}, N < 00 , be a set of indices and let be a sequence of random 

variables satisfying the following tail condition: for any r > 0, 

- Bi> t) <Ci exp (-r^/(2(Tj^)) + C 2 exp (-rsi) (6) 

for some positive sequence {Bf), nonnegative sequence (ai) and nonnegative sequence (si) of numbers, and 
for constants Ci, (72 > 0. Then for any a < ai, s > Si, and 

6 , = max|^y/21og(CTi/CT) + 41og(i), log (*^(s/s0)| + 

it holds that 


Esup{A,-Pi6i,}<3(7iCT + 2C'2(s)"^ (7) 

We remark that Bi need not be the expected value of Xi, as we are not interested in two-sided deviations 
around the mean. 

One of the approaches to obtaining oracle-type inequalities is to split a large class into smaller ones 
according to a “complexity radius” and control a certain stochastic process separately on each subset (also 
known as the peeling technique). In the applications below, Xi will often stand for the (random) supremum 
of this process, and Bi will be an upper bound on its typical size. Given deviation bounds for Xi above 
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Bi, the dilated size BiOi then allows one to pass to maximal inequalities (7) and thus verify achievability in 
Lemma 1. The same strategy works for obtaining data-dependent bounds, where we first prove tail bounds 
for the given size of the data-dependent quantity, and then appeal to (7). 

A simple yet powerful example for the control of the supremum of a stochastic process is an inequality 
due to Pinelis [22] for the norm (which can be written as a supremum over the dual ball) of a martingale in 
a 2-smooth Banach space. Here we state a version of this result that can be found in [23, Appendix A]. 

Lemma 3. Let Z be a unit ball in a separable (2, D)-smooth Banach space %. Then for any Z-valued tree 

z, 


whenever n > rjAD^. 

When the class of functions is not linear, we may no longer appeal to the above lemma. Instead, we make 
use of the following result from [24] that extends Lemma 3 at a price of a poly-logarithmic factor. Before 
stating the lemma, we briefly define the relevant complexity measures (see [24] for more details). First, a set 
V of M-valued trees is called an a-cover ot Q £ on z with respect to ip if 

71 

V 5 € C/, Ve e {il}”, 3v e y s.t. Yu{g{ 2 .t{e)) - < na^. 

t=i 

The size of the smallest a-cover is denoted by Mp{Q,a,'L), and Afp{Q,a,n) = sup^, A/'p(^, a,z). 

The set V is an a-cover of C/ on z with respect to £00 if 


EetZt(e) 


> T j < 2 exp 


8D% 


V 5 6 Ve € {±1}, 3v e y s.t. j 5 (zt(e)) - Vt(e)j < a Vt€[n]. 


We let Afoo{G,ct,z) be the smallest such cover and set Afoo{G,ct,n) = sup^Afoo{G,ct,z). 

Lemma 4 ([24]). Let G £ [-1,1]'^. Suppose 'R-n{G)ln -»• 0 with n ->• 00 and that the following mild assump¬ 
tions hold: TZn{G) > 1/n, Afoo{G,2~^,n) > 4, and there exists a constantT such thatT > A/'oo(£/,2“'^,n)“^. 

Then for any 9 > \/l2/n, for any Z-valued tree z of depth n, 


P I sup 




< P (sup 


^et5(zt(e)) 
t = l 


8^1 + 9\J 8n log^ (en2)j■7^„(^)j 

>nmilAa + 60 f ^/logA/]x^(0,5, n)d5|) 
Q >0 I La j f 


< 2Te ^ 


The above lemma yields a one-sided control on the size of the supremum of the sequential Rademacher 
process, as required for our oracle-type inequalities. 

Next, we turn our attention to an offset Rademacher process, where the supremum is taken over a 
collection of negative-mean random variables. The behavior of this offset process was shown to govern 
the optimal rates of convergence for online nonparametric regression [7]. Such a one-sided control of the 
supremum will be necessary for some of the data-dependent upper bounds we develop. 

Lemma 5. Let z be a Z-valued tree of depth n, and let G £ For any 7 > 1/n and a> 0, 

p(sup^ (efff(zt(e)) - 2Qff^(zt(e))) - Y: - 12\/2 f logA' 2 ( 5 , <5, z)d5 - 1 > t) 

\g^gtl a Jl/n ) 

where T > N 2 {G,‘ 2 .~^"i,z)~'^ and cr = 12/7 \/ n log A/ {G, S, z)dd. 
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We observe that the probability of deviation has both subgaussian and subexponential components. 
Using the above result and Proposition 2 leads to useful bounds on the quantities in Lemma 1 for 
specific types of adaptive rates. Given a tree z, we obtain a bound on the expected size of the sequential 
Rademacher process when we subtract off the data-dependent £ 2 “norm of the function on the tree z, adjusted 
by logarithmic terms. 

Corollary 6. Suppose Q £ [-1,1]'^, and let z be any Z-valued tree of depth n. Assume \ogN 2 {Q,5,n) < 
for some p< 2. Then 


E 


sup 

965,7 t=l 




2(logn) logA/ 2 ( 0 , 7 / 2 ,z) ^^^fl^(zt(e)) + ij ~ 24\/21ogn^^ ^n\ogAf 2 iG,S,z)dS 


is at most 7 + 21ogn. 

The next corollary yields slightly faster rates than Corollary 6 when \Q\ < oo. 

Corollary 7. Suppose Q £ [-1, l]'^ with \Q\ = N, and let z be any Z-valued tree of depth n. Then 


E 


sup] £ etff(zt(e)) - 2 log log N £ g^{'z.{e)) + e 


g^G t=l 


32|logfvf]g2(-2(g))+ej 


< 1 . 


4 Achievable Bounds 

In this section we use Lemma 1 along with the probabilistic tools from the previous section to obtain an 
array of achievable adaptive bounds for various online learning problems. We subdivide the section into one 
subsection for each category of adaptive bound described in Section 1.1. 


4.1 Adapting to Data 

Here we consider adaptive rates of the form Bn{xi-.n,yi-.n) or Bn{yi-.n), uniform over / e JF. We show the 
power of the developed tools on the following example. 


Example 4.1 (Online Linear Optimization in K'^). Consider the problem of online linear optimization 
where IF = {f : \\f \\2 < 1}; y = {y ■ \\y \\2 < 4}, X = {0}, and £{y,y) = {y,y)- The following adaptive rate 
is achievable: 


Bnivi-.n) = lQ\fd\og{n) 



+ 16\/dlog(n), 

a 


where ||j|^ is the spectral norm. Let us deduce this result from Corollary 6. First, observe that 


/n \l /2 


( n \ 1/2 


n 

(£ ytvl ) 

= sup 

L! ytv't f 

= sup -V 

P'ZytyJf = 

t = l \ 

\t=i / 

, Fii/ii.si 

\t=l / 

/ii/ibsi N 


n 




The linear function class T can be covered point-wise at any scale 6 with (3/i5)‘^ balls and thus 


Af{ioF,ll{2n),z) < 


for any y-valued tree z. We apply Corollary 6 with 7 = 1/n (the integral vanishes) to conclude the claimed 
statement. 
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4.2 Model Adaptation 

In this subsection we focus on achievable rates for oracle inequalities and model selection, but without 
dependence on data. The form of the rate is therefore Assume we have a class T = Ui^>l with 

the property that R{R) £ T{R') for any R < R' . If we are told by an oracle that regret will be measured with 
respect to those hypotheses / 6 IF with i?(/) = inf{i? : / 6 lF(i?)} < i?*, then using the minimax algorithm 
one can guarantee a regret bound of at most the sequential Rademacher complexity 7?,„(lF(i?*)). On the 
other hand, given the optimality of the sequential Rademacher complexity for online learning problems for 
commonly encountered losses, we can argue that for any f ^ T chosen in hindsight, one cannot expect a 
regret better than order 7?,„(lF(i?(/))). In this section we show that simultaneously for all / e IF, one can 
attain an adaptive upper bound of O {jZn{R{R{f)))\J\og (7?.„(lF(i?(/)))) log^'^^ nj. That is, we may predict 
as if we knew the optimal radius, at the price of a logarithmic factor. This is the price of adaptation. 

Corollary 8. For any class of predictors T with .F(l) non-empty, if one eonsiders the supervised learning 
problem with 1-Lipschitz loss £, the following rate is achievable: 


6„(/) = Ffl7^„(^(2R(/))) log3/2^ 

+ i^2^7^„(^(l))log3/^n, 



log( 


V 




for absolute constants Ki,K 2 , and T defined in Lemma 4- 

In fact, this statement is true more generally with F{2R{f)) replaced by £o JF(2i?(/)). 

It is tempting to attempt to prove the above statement with the exponential weights algorithm running 
as an aggregation procedure over the solutions for each R. In general, this approach will fail for two 
reasons. First, if function values grow with R, the exponential weights bound will scale linearly with this 
value. Second, an experts bound yields a rate which spoils any faster rates one may obtain using offset 
Rademacher complexities. 

As a special case of the above lemma, we obtain an online PAC-Bayesian theorem for infinite classes of 
experts. However, we postpone this example to the next sub-section where we get a data-dependent version 
of this result. Neither of these bounds appear to be available in the literature, to the best of our knowledge. 

We now provide a bound for online linear optimization in 2-smooth Banach spaces that automatically 
adapts to the norm of the comparator. To prove it, we use the concentration bound from [22] (Lemma 3) 
within the proof of the above corollary to remove the extra logarithmic factors. 

Example 4.2 (Unconstrained Linear Optimization). Consider linear optimization with y being the unit 
ball of some reflexive Banach space with norm ll-ll^. Let T = 7) be the dual space and the loss £{y,y) = {y,y) 
(where we are using (•,•) to represent the linear functional in the first argument to the second argument). 
Define P{R) = {/ | ||/|| < R} where Ij-ll is the norm dual to Ij-ll,,.. If the unit ball of y is {2, D)-smooth, then 
the following rate is achievable for all f with ||/|| > 1: 

B{f) = DV^(8\\f\\(l + Vlog(2||/||)+loglog(2||/||)) + 12). 

For the case of a Hilbert spaee, the above bound was achieved by [15]. 


4.3 Adapting to Data and Model Simnltaneously 

We now study achievable bounds that perform online model selection in a data-adaptive way. Of specific 
interest is the example of online optimistic PAC-Bayesian bound which —in contrast to earlier results—does 
not have dependence on the number of experts, and so holds for countably infinite sets of experts. The 
bound simultaneously adapts to the loss of the mixture of experts. This example subsumes and improves 
upon the recent results from [18, 14] and provides an exact analogue to the PAC Bayesian theorem from 
statistical learning. Further, quantile experts bounds can be easily recovered from the result. 
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Example 4.3 (Generalized Predictable Sequences (Supervised Learning)). Consider an online 
supervised learning problem with a convex l-Lipschitz loss. Let {Mt)t>i he any predictable sequence that the 
learner can compute at round t based on information provided so far, including xt (One can think of the 
predictable sequence Mt as a prior guess for the hypothesis we would compare with in hindsight). Then the 
following adaptive rate is achievable: 


Bn{f;xi..n) = inf< Ki 




logn • log.A/' 2 (J', 7 / 2 , n) ■ (fixt) - Mtf + 

4-/^2 logn f \/nlogAf 2 {iF,S,n)dS + 21ogn + 7 

J 1/n 


for constants Ki = 4\/2, K 2 = 24^/2 from Corollary 6. The achievahility is a direct consequence of Eq. (5) 
in Lemma 1, followed by Corollary 6 (one can include any predictable sequence in the Rademacher average 
part because is zero mean). Particularly, if we assume that the sequential covering of class if grows 

as logA/ 2 (JF, e, n) < e~^ for some p <2, we get that 


Bnif)= O 


n 

Y{f{xt)-M^f+l 

t=i 





As p gets closer to 0, we get full adaptivity and replace n by {f{xt) - Mt)^ + 1. On the other hand, as 
p gets closer to 2 (i.e. more complex function classes), we do not adapt and get a uniform bound in terms 
ofn. For p € (0,2), we attain a natural interpolation. 

Example 4.4 (Regret to Fixed Vs Regret to Best (Supervised Learning)). Consider an online 
supervised learning problem with a convex l-Lipschitz loss and let \T\ = N. Let f* e iF be a fixed expert 
chosen in advance. The following bound is achievable: 


Bn(f,xi,n) = 41og|logivf)(/(a;t) - fixt))^ + 32|log Vf)(/(a;t) - /*(xt))2 + ej + 2. 


In particular, against f* we have 


and against an arbitrary expert we have 


Bn{f*,Xi..n) = 0 ( 1 ), 


Bn{f,xi..n) = o(^\/nlogN{logn + log logiV)j. 

This bound follows directly from Eq. (5) in Lemma 1 followed by Corollary 1. This extends the study of [25] 
to supervised learning and a general class of experts T. 

Example 4.5 (Optimistic PAC-Bayes). Assume that we have a countable set of experts and that the 
loss for each expert on any round is non-negative and bounded by 1. The function class T is the set of all 
distributions over these experts, and X = {0}. This setting can be formulated as online linear optimization 
where the loss of mixture f over experts, given instance y, is {f,y), the expected loss under the mixture. The 
following adaptive bound is achievable: 


Bn{f;yi-.n) = 


\ 


50 (KL(/|7r) + log(n)) ^ + 50 (KL(/|7r) + log(n)) + 10. 


This adaptive bound is an online PAC-Bayesian bound. The rate adapts not only to the KL divergence 
of f with fixed prior tt but also replaces n with 2/i) • Note that we have Yu=i^i~f{ei,yt) i 














ifiyt); yielding the small-loss type bound described earlier. This is an improvement over the hound in 
[18] in that the bound is independent of number of experts, and thus holds even for countably infinite sets 
of experts. The KL term in our bound may be compared to the MDL-style term in the bound of [19]. If we 
have a large (but finite) number of experts and take the uniform distribution tt, the above bound provides an 
improvement over both [I 4 ] and [18] for quantile bounds for experts. Specifically, if we want quantile bounds 
simultaneously for every quantile e then for any given quantile we can use uniform distribution over the top 
1/e experts and hence the KL term is replaced by log(l/e). 

Evaluating the above bound with a distribution f that places all its weight on any one expert appears to 
address the open question posed by [13] of obtaining algorithm-independent oracle-type variance bounds for 
experts. 

The proof of achievability of the above rate is shown in the appendix because it requires a slight variation on 
the symmetrization lemma specific to the problem. 


5 Relaxations for Adaptive Learning 

To design algorithms for achievable rates, we extend the framework of online relaxations from [26]. A 
relaxation Rel„ : Utt=o T* x 3 ^* ->• K is admissible for an adaptive rate Bn if Rel satisfies the initial condition 

Rel„(a;i;„,yi;„) > - infl y£{f{xt),yt) + Bn{f;xi-.n,yi-.n) t (8) 

/6.5^U=l J 


and the recursive condition 


Rel„(a;i;t_i,yi;t_i) > sup inf supEy...q^[i{yt,yt) +'R.eln{xi,t,yi-.t)]- (9) 

xtiiX gtsAiv) vt^y 


The corresponding strategy qt = argming^^^(p) sup^^.y j/*) + Rel„(xi;t, yi,*)] enjoys the adaptive 

bound 


n f n 

Y,('{yt,yt) - Vni\Y[,i{f{xt),yt) + Bn{f]Xi,n,yi,n) 

t = l 


< Rel,^(*) X\.n,yi\n- 


It follows immediately that the strategy achieves the rate Bn{f;xi.,n,yi-.n) +Rel„(-)- Our goal is then to find 
relaxations for which the strategy is computationally tractable and Rel„(-) < 0 or at least has smaller order 
than Bn- Similar to [26], conditional versions of the offset minimax values An yield admissible relaxations, 
but solving these relaxations may not be computationally tractable. 


Example 5.1 (Online PAC-Bayes). Consider the experts setting described in Example 4.5 and an adaptive 
bound, 

Bnif) = 3^2nmax{AL(/ | tt),!} + A^/n. 

Let Ri = 2*“^ for f 6 N and let q[^{y) denote the exponential weights distribution with learning rate \/R[n 
given losses yi,t: q^{yi-.t) Efc..^efc exp^—yt, e* >) (wherein Ck is the kth standard basis vector). 
The following is an admissible relaxation achieving Bn'- 


Rel„(yi:t) = inf 

A>0 


^log|^exp|-A '^y^'{yi,s-i),ys) + 'J^i jj + 2A(n-t) 


To achieve this strategy we maintain a distribution q] with {q[)i oc exp^-^[X]*=i((j'^‘(yi:s-i),ys) 



We predict by drawing i according to q], then drawing an expert according to y^’(yi;t_i). 

This algorithm can he interpreted as running a “low-level” instance of the exponential weights algorithm 
for each complexity radius Ri, then combining the predictions of these algorithms with a “high-level” in¬ 
stance. The high-level distribution q] differs slightly from the usual exponential weights distribution in that 
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it incorporates a prior whose weight decreases as the complexity radius increases. The prior distribution 
prevents the strategy from incurring a penalty that depends on the range of values the complexity radii take 
on, which would happen if the standard exponential weights distribution were used. 


While in general the problem of obtaining an efficient adaptive relaxation might be hard, one can ask 
the question, “If and efficient relaxation Rel^ is available for each T{R), can one obtain an adaptive model 
selection algorithm for all of Tl”. To this end for supervised learning problem with convex Lipschitz loss we 
delineate a meta approach which utilizes existing relaxations for each T{R) to obtain algorithm for general 
adaptation. 

Lemma 9. Let gf'(?/i,... ,2/t-i) he the randomized strategy corresponding to Rel^, obtained after observing 
outcomes j/i,... ,yt-i, and let 9 :M. ^ M. be nonnegative. The following relaxation is admissible for the rate 

B„(i?) = Rel«(.)0(ReC(-)).- 


Ada„ (lilt, 1/1:0 = 


sup Ee 


sup 

x,y,y' R>1 


Rel^(xi:t,i/i:t)-Rel^(-)6l(Rel^(-))+ 2 ^ 

S = t + 1 


_^(e))[eiysiR),ysie))] 


Playing according to the strategy for Ada„ will guarantee a regret bound of the form Bn{R) + Ada„(-), 
and Ada„(-) can be bounded using proposition 2 when the form of 9 is as in that proposition. 

We remark that the above strategy is not necessarily obtained by running a high-level experts algorithm 
over the discretized values of R. It is an interesting question to determine the cases when such a strategy 
is optimal. More generally, whenever the adaptive rate depends on data, it is not possible to obtain 
the rates we show non-constructively in this paper using some form of exponential weights algorithms using 
meta-experts as the required weighting over experts would be data dependent (and hence is not a prior 
over experts). Further, the bounds from exponential-weights-type algorithms are more akin to having sub¬ 
exponential tails in Proposition 2, but for many problems we might have sub-gaussian tails. 

Obtaining computationally efficient methods from the proposed framework is an interesting research 
direction. Proposition 2 provides a useful non-constructive tool to establish achievable adaptive bounds, and 
a natural question to ask is if one can obtain a constructive counterpart for the proposition. 
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A Appendix 

Proof of Lemma 1. We first prove Eq. (3) and (4). We start from the definition of AniP)- Our proof 
proceeds “inside out” by starting with the term and then working backwards by repeatedly applying the 
minimax theorem. To this end on similar lines as in [24, 7, 21], we start with the inner most term as. 


sup inf sup I 
(j„6A(i>) Vn^y \ 


i{yn,Vn) - inf \ + Bn{f\Xi,n,yi-.n) 

U =1 


= sup inf sup Y,^{yt,yt) - \Y,^{f{xt),yt) + Bn{f;xi..n,yi-.n)\ 

q „ iA ( V ) p „ iA ( y )\ Vr .~ P ^ lt=l JJ/ 

' n f n 

Y,(-{yt,yt) - inf \ Y,(-{f{xt),yt) +Bn{f]Xi,n,yi,u) 

.t = l U =1 


= sup sup inf I 'Ky^^q^ 

Xr,iX p„ 6 A(y) g^eA(V) \ yn~P,i 


= sup sup inf I Ey^^p^ 

p^eA(3^) Vn^'D 


n f n 

T,^(yt,yt) - inf \Y,^ifi^t),yt) +Bn{f;xi-.n,yi-.n) 
t=i U=i 


= sup sup I Ey^^p^ 

xniX p^sA(y) 


supt inf Ey^^p^ 


'ZKyt,yt) 




) 


To apply the minimax theorem in step 3 above, we note that the term in the round bracket is linear in 
and in (as it is an expectation). Hence under mild assumptions on the sets V and 3^, the losses, and 
the adaptive rate S„, one can apply a generalized version of the minimax theorem to swap sup^^ and inf^^. 
Compactness of the sets and lower semi-continuity of the losses and are sufficient, but see [24, 21] for 
milder conditions. Proceeding backward from n to 1 in a similar fashion we end up with 


= (( sup inf sup E 

\xtiX (jt6A(I>) yt^y yt~qtll 


'Z^iyt,yt) - inf I'Z^ifixtJ^yt) + Sn(/;a;n„,?/i:„) 
t=i U=i 


= ({ sup sup E 
\xtiXpt^A(y)yt~Ptll 


< (( sup sup E 


supl ^ inf Ey^^p^ [£iyt,yt)] - Y,^ifi.^t),yt)-Bn{f-,xi,n,yi-n) 

yt=ivti'D t=i 


a:t6A’pt6A(y)i'*~P‘//t=lL/e.^ U=1 


snp] W{xt),y't)] - i{f{xt),yt) - Bn{f-,xi,n,yi-.n) 


( 10 ) 


See [21] for more details of the steps involved in obtaining the above equality. Form this point on we split 
the proof for Equations 3 and 4. To prove the bound in Equation 3, note that, Bn{f',Xi-.n,yi:n) = Bn{yi-.n) 
and so, (this proof is similar in spirit to the one in [7]) 


An{P) < (( sup sup E 

\xtiXptiA(y)yt~Ptll 


= (( sup sup E 


sup) E®pl~pt Wfixt),y't)] - i{fixt),yt)\ - Bniyi-.n) 

U=i J 

( n 111 

sup)E®y;~pt Wfi^t),yt)]-^(fixt),yt)\- -Bn(yi-.n)- -Bu{yi-.u) 
\xt6A’pt6A(y)i'‘~P‘//t=iL/s.7^ U=1 J 2 2 

Using linearity of expectation repeatedly (since Bn is independent of / and XtS ), 


An{P) < (( sup sup E 

\xtiXptiA(y) yt~pt„t=i 


sup]X!®p;~pt W{xt),y't)]-i{f{xt),yt) 

U=i 


^Bniyi-.n) [^"( 2 /l:n)] 
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By Jensen’s inequality, we pull out the expectations w.r.t. y'^’s to further upper bound the above quantity 

by 


sup sup E _ 


sup \ Y, £{f{xt),yt) - yt)\- \Bn{yi-.n) - \Bn{y'i,n) 


= ((sup sup E Egj 

\xtiX pt<iA{y)vt,Vt~pt 


sup] {(■{f{xt),y't)-£{f{xt),yt))\- \Bn{yi-.n) - \Bn{y'i,n) 

U =1 


< (i sup sup Egj 

XxtuX yt,y^^y 


sup] XI W{xt),y't) - £{f{xt),yt)) !• - ^Bniyi-.n) - \Bn{y[,Y) 


t=iL/^-^ l,t=i 


< (( sup sup Eej 
\xtiX yt^y llt=l 


sup] X] 2 et^(/(a^t),yt) \ - Bn{yi-.n) 
U=i 


= supE, sup] 2 ^ et^(/(xt(e)),yt(e)) [ - S„(yi,„(e)) 
x.y L/e-7’' I t=i ] 

where the last but one step is by sub-additivity of supremum and linearity of expectation and last step is by 
skolemizing the supremum interleaved with average w.r.t. Rademacher random variables in the binary tree 
format. 


We now move to proving Eq. (4). We start from Eq. (10): 


An{£F) < (( sup sup E 

\xtiXptiA{y)yt~Pcll 


sup] X]Ey;~pt [£if{xt),y't)] - £if{xt),yt) - Bnif;xi-.n,yi-.n) 
U=i 


Using Jensen’s inequality to pull out the expectations w.r.t. y^’s, we get 


s^p\Y^(fi^t),yt) - £{f{xt),yt) - Bn{f-,Xi,n,yi-.n) 


< ((sup sup E _ 

xtiXptsA{y)yt’y't-ptii tt=i 


< (( sup sup E sup 
\xtiX pteA(y) yt,y't~Pt y'l^yH 

= (( sup sup E Egj sup 


xtiXptsA{y)yt,yt~pt yt^yll u=i 


< (( sup sup Ejj sup 


sup ] Y ^ifi^t),yt) - £ifixt), Vt) - Bnif; Xi,n,y”,Y) 
U=i 


sup] ^e* {i{f{xt),yt) - i{f{xt),yt)) - Bn{f;xi-.n,y”n) 


Xt^x yt,y'^iy y"<iyll U=1 


sup] {£{f{xt),yt)-£{f{xt),yt))-Br,{f-,xi..r„y”,J 


< (( sup sup Eej sup 

\xtiX yt^y y'-t'^yll^^^ 


= sup Ee 

x.y.y' 


sup] X! 2 et^(/(a;i)>?/t) -^n(/;a;i:n,?/i;n) 
U =1 


sup] 2 X]et^(/(xt(e)),yt(e))- 6 „(/;xi;„(e),y 2 ;„+i(e)) 

[ t=i 


where in the last step we switch to tree notation, but keep in mind that each y” is picked after drawing e*, 
and thus the tree y' appears with one index shifted. 


Finally, we proceed to prove inequality (5). Here, we employ the convexity assumption £{yt,yt) - 
i{f{xt),yt) < £'{yt,yt){yt - f{xt)), where the derivative is with respect to the first argument. As before. 
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applying the minimax theorem, 


= (( sup inf sup _E 

\xtiX 5 teA('D) Vt^y 


= (( sup sup inf E 
\xtiXptiA(y)vti'D yt~Pt 


t=l 

n r 


T,^(yt,yt) - inf \ + Bn{f;xi-.n,yi-.n) 

t=i U=i 


i + ^"(/i^nn, 2 /nn) 

(^iLt=l U=l 


< (( sup sup inf E 

\xt<iXptiA(y)vtiV vt~Pt 


t=i 


s^p\'E^'iyt,yt)iyt - /(a;*)) -Sn(/;a;n„,yn„) 
U=i 


We may now pick ijt = yl{pt) - argmin^Ey^^p^ ?/*)]• By convexity (and assuming the loss allows 
swapping of derivative and expectation), E^^^p^ [_B{yt,yt)^ = 0. This (sub)optimal strategy yields an upper 
bound of 

sup sup E 

XtiX ptiA(y)Vt~Ptll 

Since (f (y*, yt) - E^j.p^ [f (y*, y()]) is independent of / and has expected value of 0 , the above quantity 

is equal to 


sup] ^ , 2 /t) -Ey;~Pt [i'iyhy't)]) (Vt - - Bnif;xi,n,yi,^) 


sup sup E 

xtiXptiA{y)yc~Ptll 

< (( sup sup E 


sup] ^ (Ep;~pj [i'iyt,yt)]-i'iyt,yt))f{xt)-Bn{f;xi,n,yi-.n) 
U=i 


xt<iXp^iA{y) yt^y't-ptH U=i 


sup] XI {y*t ^y't) - iilt ^Vt)) f{xt) - Br,{f-,Xi,n,yv.u) 


= (( sup sup E Ee 


sup ] X] {i'ift ^y't)- ^'{y*t . yt)) f{xt) - Bn{f-, Xi,n,yi-.n) 


\xtiXp^fiA{y)yt,Vt~pt II ft=i 
Replacing (£'(y(,?/() -£'(yll,yt)) by 2Lst for st € [-1,1] and taking supremum over st we get, 


< (( sup sup E sup Eg 


t=i 


\Y,‘2L€tStf{xt) -Br 


sup ^ 
U=i 


Pt^A{y) yt^Vt^Pt ste[-l,l] 

^ r f n 

sup ] XI ‘^LetStfixt) - S„(/; xi.,n,yi,n) 
U=i 


(/? Xi-,m yi:n) 


< (i sup sup sup Egj 

\xtiiX yt st6[-l,l] 


Since the suprema over st are achieved at {±1} by convexity, the last expression is equal to 


sup sup sup Egj 
xtiX yt Sts{-l,l} 


sup] Yj2LetStf{xt) - Bn{f]Xi,n,yi-.n) 

U=1 


= (t sup sup Egg 
XxtuX yt II t=l 


sup I Y,^Letf{xt) - Bnif;xi-.n,yi-.n) 

U = 1 


sup\'^2Letf{yit{e)) - Bn{f;yii-.n{e),yi-.n{e)) 


= sup Eg _ 

’'■y L/e-7^ U=i 

In the last but one step we removed St, since for any function dt, and any s € {±1}, E [^'(se)] = | (di'(s) + 'I'(-s)) 

i(vI/(l) + vl,(-l)) = E[^k(e)]. 

□ 

Proof of Proposition 2. Define Zi = [Xi - As long as 6i > 1, for any strictly positive r we have 

the tail behavior 


P{Z, >t) = P{X, - BA > r) < Cl exp 


{Bt{et-i) + Ty 

2A 


+ C2exp(-(Rj(0i - 1) + T)si). 
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Note that for any positive sequence {6i)i^i with S = 


E 


sup{Xi - BiOi) 

i€l 


< E 


sup Zi 


^E[Z,]<(5 + ^ P{Z,>T)dT. 

isl 16/ 


The sum of the integrals above is equal to 

^ / P{X,-B,0,>T)dT 

16/ 


u 


exp - 


(5,(0,-1) + t)' 


2a? 


■) 


dt+C2Z [ 


<Ci 


exp (- (5i(0, - 1) + r) Si) dr 

J f' oo 

e~'"‘''dT 

0 


Cl ^cr,exp[-;^ [ —] {Oi - 1)^) + C2 ^s,“^exp(-5iSi (0^ - 1)) 

I.gT \ ^ \ J I ig T 


i^I 

'k‘^\/tt _ 7 r ^ ^ /-\-i 

- ■ 

where the last step is obtained by plugging in 

0i = max|^y/21og((Ti/d-) + 41og(/), (SiS,)“4og (/^(s/si))| + 1 

and using as an upper bound ^\/2\og{pa^ifW) + \ for 0^ in the sub-gaussian part and {BiSi)~^ log {i^slsi) +1 
for 9i in the sub-exponential part. Since 5 can be chosen arbitrarily small, we may over-bound the above 
constant and obtain the result. □ 

Proof of Lemma 5 . Fix 7 > 0 . For j > 0 , let Vj be a minimal sequential cover of ^ on z at scale (dj = 
and with respect to empirical £2 norm. Let [g, e] be an element guaranteed to be /3j-close to / at the j-th 
level, for the given e. Choose N = log 2 ( 27 n), so that (d^n < 1. Let us use the shorthand ^ 2 ( 7 ) - A/ 2 (t/, 7 ,z). 
For any e € {±1}" and g ^Q, 

n 

^etg(zt(e))-2ag(zt(e))^ 

t=l 

can be written as 

n n 

Z i^tigiztie)) - v?[5, e](e))) + Z (etv°[5, e](e) - 2ag(zt(e))2) 

t=l t=l 

n n 

^ Z (et(0(zt(e)) - v?[5,e](e))) + Z (etv°[5, e](e) - av°[g, e](e)^) 


t=l 


t = l 
n N 


= Z i^tigiztie)) - vf [ 5 ,e](e)) + Z Z (vf[5,e](e) - vf ^[5,e](e)) + Z (etv°[g, e](e) - av°[g, e](e)^). 

t=l t=lk=l t=l 

By Cauchy-Schwartz, the first term is upper bounded by n^N ^ 1- The second term above is upper bounded 

by 

N n N n 

Z Ze/(vf[ff,e](e)-vZHffc](e)) ^ Z sup Zetwf(e), 

fc=lt=l fe=l w''6M/fc t=l 

where Wk is a set of differences of trees for levels k and k - 1 (see [24, Proof of Theorem 3]). Finally, the 
third term is controlled by 

n n 

Z (eiV°[g,e](e) - av°[g,e]{e)^) < sup ^ (etVt(e) - av^{e)) . 

t=l V6Vo t = l 
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The probability in the statement of the Lemma can now be upper bounded by 


log A /2 ( 7 ) 


I fc=l w'=€W'fc t=l veVot^i Ct 


- Vl\f2 f \Jn logA/2((5)d5 : 

J\jn 


In view of 


N _ 

\/^ Z /3fc\/nlogA/'2(/3fc) < 12\/2 / VudogA/^^dJ 

fc=l 


this probability can be further upper bounded by 

/ N 

Z _ 

U=1 w'=6Wfc t=l V6Vbi = l 

Define a distribution p on {1,..., A/} by 


^(Z Z *^tWt^(£) + sup f] (£iVt(e) - av^^Ce)) - ^ / 3 fc\/nlogA/' 2 (/ 3 fc) > rj . 


Pk = 


/ 3 fc\/nlogA/' 2 (/ 3 fc) 

EZi/3.V«logA4(/3,)' 

Then the above probability can be upper bounded by 


rpk 


p( 3fc 6 [fV] s.t. sup n\ogM2{l3k) > „ 

■w^^Wk t=l ^ 


V supZ(etVt(e)-av2(e))-> I 

veVo t=l Q; 2 


) 

N / n _ \ 

<Z^’( sup ZetWt^(e) - \/^/ 3 fc\/nlogA/' 2 (/ 3 fc) > 

k = l \w''6Wfci = l ^ / 


+ -P(sup Z(etvt(e) -av2(e)) - ^ ^ 

\v6Vottl a 2 

The second term can be upper bounded using Chernoff method by 


) 


Z Z(etVt(e) -av2(e)) - ^°g-^ 2 ( 7 ) ^ ^ j < _;\^ 2 ( 7 ) exp-logA 4 ( 7 )] ^ expj 


veVb \t=l 

while the first sum of probabilities can be upper bounded by 


N 


r^fey/n log A/' 2 (/ 3 fc) 


Z Z P[Z^t^tie)-^Pk^rilogM2iPk)> ^ 

fc=iw'=6rVfc \t=i ^Lk=iPk\/n\ogN2{pk) / 


For any fc, the tail probability above is controlled by Hoeffding-Azuma inequality as 


/ 


P 


ZDwf(e) > /?fey/nlogA/'2(/3fc) ( 6\/2- 


,2\ 


t=i 


/ 


< exp 


logA/' 2 (/ 3 fc)( 6 \/ 2 - 


2 Zk=i / 3 fc\/nlogA/' 2 (/ 3 fc), 

\ 2 \ 
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2Ef=i / 3 fc\/nlogA/' 2 (/ 3 fc), 


/ 


< exp(-41ogA/'2(/3fc))exp 


18(2EZi/3fcV^logAf2(/3fe)) 


( 11 ) 
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because ^ 'TJt=i ^ ^y triangle inequality (see [24]). Then the double sum in (11) is upper 

bounded by 


r exp 


18 (2 Zti Pk^n\ogAf2iPk)y ^ 


where T > 


T,k=i-^'iiPk) This upper bound can be further relaxed to 


r exp 


2(12 Ii/n \/n\ogN 2 {d)d 5 ^ ^ 


Since N = log 2 ( 27 n), we may take 

log2(27") 

r= ^ M2 {i 2-^)-^. 

k=l 

□ 


Proof of Corollary 6. Let us write ^ 2 ( 7 ) = 7 \/ 2 (t/, 7 , z). Observe that 

(logn) (logAf 2 ( 7 / 2 )) 




2 (logn) (logAf 2 ( 7 / 2 )) [Y,g^{ztie)) + 1 = 




+ 2a Es (zt(e)) + l 

£ = 1 


)) 


and, furthermore, the optimal a is 


(logn) (log 7 \/' 2 ( 7 / 2 )) 

M 2(Er=iff^(zt(e)) + l) 


which is a number between di = 2 (n+Y) and du = \/ (logn) (logA/ 2 ( 7 / 2 )) as long as ^ 2 ( 7 / 2 ) > 1 . 

With this, we get 


sup 

9^0 t=l 
7e[n”^ ,1] 


\ 


2 (logn) (logA/ 2 ( 7 / 2 )) + ij + 24\/21ogn^^ \/nlogA/ 2 ( 5 )d (5 + 21 og 7 


< sup ^et 3 (zt(e))- 

9^G t=l 

7e[n“^ ,1] ,Q:6[d^ ,du] 


2 (logn) (logAf 2 ( 7 / 2 )) 


- 4a ^ 5 r^(zt(e)) - 24\/21ogn / 7 /nlogA/ 2 ( 5 )ci (5 - 21ogn. 

t=i di/n 


( 12 ) 


The case of 7 s [1/n, 2/n) will be considered separately. Let us assume 7 > 2/n. We now discretize both a and 7 by 
defining m = and 7 ^ = 2^n~^, i,j > 1. We go to an npper bonnd by mapping each a to Oi or ai/2, depending 

on the direction of the sign. Similarly, we map 7 to either 7 ; or 27 ^. The upper bound becomes 

maxsup ^ (etg(zt(e)) - 2aig^(zt(e))) - ( 21 ogn) + 12\/2 f ^ \/nlogA/ 2 ( 5 )d (5 + l) . 

g^e \ Oi Jl/n / 

Given the doubling nature of at and jj, the indices i,j are upper bounded by O(logn). Now define a 
collection of random variables indexed by {i,j) 


Wj = sup Y, ei5(zt(e)) - 2aig^(zt(e)) 
g^G t=i 


and constants 




logA/'2(7j) 


+ 12V2 


CXi 
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^ \/n logJ\f 2 {S)dd + 1 . 

1/n 




























Lemma 5 establishes that 


P(X, j j > r) < r exp ^ j + exp ^ j 


where cxj = 12\/2 /7^ logAf 2 (S)dS and F as specihed in Lemma 5. Whenever 5-entropy grows as S p, 

CTj < 12\/2^/n, ensuring log(crj/cri) < log(n). Further, we can take 1 < F < log(2n). 

Proposition 2 is used with a sequence of random variables, but we can easily put the pairs (*,j) into a 
vector of size at most log 2 (n)^. Observe that Si = Q;i/2, {BijSi)~^ < 2, ajiBij < 1, si/si < \/2{n + 1). Then, 
by taking a = min{l/F, cti} and s = si, 

= max|^y21og(crj/d-) + 41og(fcij), (B,jSi)“Mog (s/si))| + 1 
< max I y21og(n) + 21og(log(2n)) + 41og(fcij), 2 log \/2{n+l)^ } + 1 

where kij = (logn) • (i - 1) + j. This choice of the multiplier ensures 

Emax{W,j “ jBij} < SFct + < 7 

'^i3 

and Oi j is shown to be upper bounded by 2logn. Hence 


E 


sup X!etfl'(zt(e))-4A 


2 (logn) logA/2(7/2) ( ^ g2(2t(e)) + 1 j - 24 \/ 21 ogn f \Jn logA/2(5)d5 

\t=i / 


- ■ \ 

9Ee,7t=i \ 

Now, consider the case 7 6 [l/n, 2 /n). We upper bound ( 12 ) by 


maxsup y (et5(zt(e)) - 2Q!,5^(zt(e))) - (21ogn) 
* seS t=i 


<7 + 2 logn. 


| logW'2(l/n) ^ 


which is controlled by setting 7 = 1 /n in Lemma 5 . This case is completed by invoking Proposition 2 as 
before. □ 

Proof of Corollary 7 . Assume N > e and let C > 0 . We hrst note that 

fciogf^^^^ 

) V “ 


inf-j 

a>0 


) + a(g5^(z,(e)) + j| < 21og(logiv|:g^(z(e)) + e )l cIlogA^lj 


5^(z(e)) +e 


with the inequality obtained using 


a = 


C\ogN 


Sr=i5^(z(e)) +e/logA^’ 
which is a number between de = yj and du - -^^logA^. Subsequently, 


n / n \ / ^ \ 

supX!et£'(zt(e))- 21 og logA^^g^(z(e)) +eK C log Af ^52(-2(g)) + e 
g^Q t=i \ t=i / > \ t=i / 


< sup 

9^0 

ae[d^ ,du 


t=l 

^2/ 


Y- / / Y- 2/ / C\ogN (VClogN 

}_^etg{zt{e))-a}_^g (zt(e))-log' 


18 



























Let L = 


l0g2(V'^ + l) + l 


following upper bound holds: 


We discretize the range of a by defining at 


du2 for i e [L], The 


sup 
g^g 
HL] 

Define a collection of random variables indexed by i e [L] with 

w = sup £etg(zt(e)) - ^ (zt(e)) 

g<^Slt=i ^ t=i 

and let Bi = ^ . Applying Lemma 5 with 7 = 1/n establishes 

P{Xi -Bi>T)< exp^-^ j. 

We now set Si = ai/8 and s = Si, and apply Proposition 2, yielding 


^et5(zt(e)) - ^ ^g^(zt(e 
_t=l ^ t=l 


CXi I CXi J 


E{X,-BA}<^^- 

It remains to relate this quantity to the rate we are trying to achieve. Note that our bound on P{Xi -Bi> t) 
has a pure exponential tail, so we only need to consider 9i = {BiSi)~^\og{i^{slsi)) + 1. Taking C > 32 and 
observing that {BiSi)~^ < 2 , we obtain 

Oi = {BiSiy^\og{f{slsi)) + 1 < 21 og(z^(s/sg)) + 1 = 21 og(i^ 2 *"^) + 1 < 21 og(j^ 2 *) 

4 \ Qfi / 


Finally, we have 


n 


sup 

g^Q 

2 e[L] 


^et 5 (zt(e)) 

.t=l 




321ogiV^^^/N/321ogA^y 

cxi y Oii J 


<E{X,-Byy<^<l. 


□ 


Proof of Corollary 8. We prove the corollary for convex Lipschitz loss where we remove the loss function 
using the symmetrization lemma shown earlier. However even if we consider non-convex classes, the loss 
is readily removed in the step in the proof below where we apply Lemma 4 where the Lipchitz constant is 
removed when we move to covering numbers. However this is a well known technique and to make the proof 
simpler we simply assume convexity of loss as well. Our starting point to proving the bounds is Lemma 1, 
Eq. (4). To show achievability it suffices to show that 


E, 


n 

sup Y, - KiTZ„{P{2R{f))) log®^^ n 

t=i 



log 


7^„(^(A(l))) } 


+ log(log(2i?(/))) 


< A2^7^„(J^(l))log®/"n 


where F is the constant that will be inherited from Lemma 4. 


Define Ri = 2® and note that since the 
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Rademacher complexity of the class T{R) is non-decreasing with ii, 
sup ^ £t/(xt(e)) - Ki'Jln{T{2R{f))) n I 1 


t=l 


A 




:sup sup ^et/(xt(e))-^^■l7^„(JP(2i^))log®^^n[l + ^ log[^^^^^^-^] + log(log(2R)) I 


<max sup f]£t/(xt(e))-A'l7^„(J*^(R^))log®^^n|l + ^ log(^^^^^^^) + log(log(Ri))|- (13) 


Denote a shorthand C„ = -^96log^(en2) and = Tln{R{Ri))- Now note that by Lemma 4 we have that 
for every i and every 9 > 1, 


sup 

n 

'£etf{xt{e)) 


t=l 


>i{i + eCn)-D^n) <2re- 


-30^ 


Let Xi = lE”=i et/(xt(e))| and let = 8 (1 + Cn) ■ D\. In this case rewriting the above one sided 

tail bound appropriately (with 0=1 + rKSCnDl^)) we see that for any r > 0, 


P(X, - B, 


. 2r / 

> r) < — expl- 


2Hog\en'^)Tll{T{R,)) 


This establishes one-sided subgaussian tail behavior. Now applying Proposition 2 and setting 6i as suggested 
by the proposition we conclude that 


E, 


/ 


max sup ^et/(xt(e)-^^l7^„(JP(i^,))log^/^n 


1 + 


\| 




iL2^7^„(^(l))log3/"n. 


This concludes the proof by appealing to Eq. (13). □ 

Proof of Achievability for Example 4-2. 

Lemma 10. The following bound is achievable in the setting of Example 4-2: 

B{f) = i?^^(8||/||(l + v'log(2||/||)+loglog(2||/||)) + 12). 

This proof specializes the proof of Corollary 8 to the regime where Lemma 3 applies. 

Recall our parameterization of tF: T{R) = {f e T: ||/|| <R]. It was shown in [26] that Cn{T{R)) = 
2RD^/n is an upper bound for TZn{P{R))- We consider the rate 


BM) = ‘2Cn{X{2R{f))) 


1 + 



Cn{H2R{m \ 

Cn(X(l)) i 


+ loglog 2 ( 2 i?(/)) 


We begin by applying Lemma 1 (5), yielding 


An < sup Ej sup 
y / 


n 


2E 


et{f,yt{e))-2Cn{H2R{f))) 



Cn{H2R{m \ 


+ log logs (2R(/)) 
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We now discretize the range of R via Ri = 2*. By analogy with the proof of Corollary 8 we get the upper 
bound, 


sup Eg sup 

y ieN 


sup 2^e*(/,yt(e))-2C„(^(E,)) 1 + 
t=l I 






= sup Eg sup 

y ieN 


2i?^ 


Ee‘yi(e) 


- 4i:)^/ni?iV^log(i?^) + log(i) 


Fix a Wvalued tree y and define a set of random variables Xi = 2i?i||Xl”=i etyt(e)||^. Let Bi = 2D^/ni?i. 
Lemma 3 shows that 


P{Xi - Bi> t) <2 exp 


\ SD^R'^n)' 


So we have ai = 2DRis/n, and it will be sufficient to set cr = 2Ds/n. Since our tail bound is purely sub- 
gaussian, we apply Proposition 2 with 6i = ^^/2 log((Ti/d-) +41og(i) + 1, yielding the following bound: 


sup Eg sup 

y 2eN 


2Rr 


t=l 


■ 'iD^/nR,^J\og{Ri) +log(f) 


< 12D\/n. 


□ 

Proof of Achievability for Example 4-5. Unfortunately, the general symmetrization proof in Lemma 1 
does not suffice for this problem. In what follows we use a more specialized symmetrization technique to 
prove the lemma. 

Lemma 11. For any countable class of experts, when we consider T to be the elass of all distributions over 
the set of experts, the following adaptive bound is achievable: 


Bn{f;yi-.n) = 


\ 


50(KL(/|7r) + log(n)) ^ {f,yt) + 50(KL(/|7r) + log(n)) + 1. 

4=1 


To show that the rate is achievable we need to show that An < 0. Since each yt is a distribution over 
experts and we are in the linear setting, we do not need to randomize in the definition of the minimax value. 
Let us use the shorthand 

C(/)=KL(/k) + log(n), 

and take constants Ki,K 2 to be determined later. Define 


An 


inf sup 
SteA yt<=yll 


inf I + 

t=l /eA t=l 


N 


KCif) f + y^Cif) I 


Using repeated minimax swap, this expression is equal to 


sup inf 

Pt6A(V) StEA,, 


inf 1 + 

t=l /eA t=l 


\ 


'1- 

KC{f)j:Ki.f{e,,ytf + AlFCif)\ 


£ inf [{yt,yt)]- inf f ^{f,yt) + 


sup _ _ 

ptEA(>>) t=l§tEA /eA 

By sub-additivity of square-root, we pass to an upper bound 


\ 


KCif) + VlFCif) I 


(supEp 


yt~pt 

W Pt II t=i 


sup^ inf [{yt,yt)]-En,~f [{ei,yt)] ■ 

t=i StEA 


N 


Cif) (^Kf:Ei.f[{e.,ytf] + K'Cif)^ 
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We now split the square root according to the formula \/ab = inf^^Q {a/2a + afe/2} and note the range of the optimal 
value: 


1 




Cif) 


1 


(14) 


Let us discretize the interval by setting ai = for i = 1,... ,N and note that we only need to take N 

0(log(n)) elements. Write / = {ai,... ,aiv}. Observe that 

\/ah = inf {al2a + abji) > min {a/4a + abl2} . 

ci>0 

For the rest of the proof, the maximum over a is taken within the set 7. We have 


Ar^ < |sUpE^j.pj 


sup X) inf ^ytl{yt,yt)]-^ei~f[{ei,yt)]-'^[Kf^Ei^f[{ei,ytf] + K'Cif)\-^^^^ 

t=i / 


fiA,a t=l ytiA(T) 


4a 


(15) 


Dropping some negative terms, we upper bound the last expression by 

li 


(sUpEyj.pj 

Pt //t=l 


sup J^{f,^[y't]-yt)-^'ZEi^f[{ei,ytf] 


faT.a t=l 


4a 


Adding and subtracting f [(ci, y!)^]]: 




SUpEyj.pj 

Pt 


Ka 


sup ^(/,E[i/;] -j/t) - [Ei./[(ei, 

(ijE,, [E,.f[{e.,y[f]]-E,.f[{e.,ytf]^ - ^ 


Using Jensen’s inequality to pull out expectations, we obtain an upper bound. 


llsupEy^y 'jj sup f^{f,yt-yt)-^ f^Ei^f[{ei,ytf] - ^ f^Ei^f[{ei,y'tf] 
W Pt //t=i 4 t=i 4 L j 

+ ^ [t^t~f[{euy'tf]-E,.f[{e,,y,f]j - ^ 


Next, we introduce Rademacher random variables: 


w 




^sup ^^et(^{f,yt-yt) + -Ei./[(ei,yt)^])j 


F/ \21 f/ '\^1 ^if) 

4 J 4 z^^^i~f[{^Ayt) \ 




w 


sup Eej 
yt llt^l 


sup 


Ka^ 


n / 

P Y.<^th{f,yt)+^Ei^f[{ei,yty 




Moving to the tree notation, we get 


supEe 


sup f]£t(2(/,yt(e)) + ^Ei./[(ei,yt(e))^]] - ^ Ei./[(ei,yt(e))^] 


21 KL(/|7r) log(n) 


4a 


4a 


Note that the convex conjugate of KL(/||7r) is given by Fk* (X) = - log (Ee-ir [exp (a(e, 77))]) and we express the last 
quantity as 


supEe 

y 


max — log I Ei^ 
a 4a 


exp( X! et(8«(ei,yt(£)) + 2Ka^(ei,yt{t)f) - 2Ka^ ((ei, yt(e)))' 


log(u) 

4a 
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Define ^ log (e*~^ [exp {Tt=i et( 8 a(ei, yt(e)) + 2Ka^{e^,yt{€)f'^ - 2Ka^ ((ej,yt(e)))^]]]. Our goal is 
to bound E [maXclXa - log(n)/4Q;}]. Now notice that 

P(X„>t)<infE[e^^“-^*] 


= inf j Eg 


< Ee 

< Ee 

< Ef 


Ei. 

Ei. 

E,;. 


E,., 

exp 

exp 

exp 


+ 2/fa^(ei,yt(e))^) -2N:a^(ei,yt(e))^j j exp(-At)j 
|^^et(8a(ei,yt(e)) + 2Ka'^{ei,yt(e)f^ - 2Ka'^{ei,yt{e)f'^ exp(-4at) 
|fj(8a(ei,yt(e)) + 2Ka^{ei,yt{e)f) -2Ka^{ei,yt{e)f^ 


exp(-4at) 


(EM4 

\t=i 


+ Ka) {ei,yt{e)) - 2Ka‘^{ei,yt{e)) 


1 ]] 


exp(-4at). 


The above term is upper bounded by exp(-4at) as soon as 4 q;^(4 + Ka)^ < 2Ka‘^, which happens when 

0<a<{\/Kj2-A)lK. (16) 

In view of (14), we know that a < ^ 7 =- Thus, to ensure (16), it is sufficient to take K = 50 and K' = 50^. 
Other choices lead to a different balance of constants. We thus have 

P{Xa >t)< exp (-4at). 

Now that we have the tail bound, we appeal to Proposition 2. Setting Si = 4ai and Bi = lj4ai, we obtain 
that 

log(n) 


E 


max i 


4a 


< 10 . 


□ 


B Relaxations and Algorithms 

Proof of Admissibility for Example 5.1. 

Lemma 12. The following bound is achievable in the setting given in example 5.1: 

Bnif) = S\/2nmax{KL{f \ tt),!} + 4^/n. 


(17) 


Following the analysis style of Corollary 8 , we directly consider an upper bound based on KL(/ | tt) 
but instead use a complexity-radius-based upper bound with the KL divergence controlling the complexity 
radius: iF{R) = {/ : KL(/ | tt) < R}. Concretely, we move from (17) to the bound 

Bn{i) = 'i's/nRi + 4^/n 

for Ri = 2®“^ with z € N. To keep the analysis as tidy as possible, we will study the achievability of 
B„{i) = D\/Rin, setting D and including additive constants only when we reach a point in the analysis 
where it becomes necessary to do so. The relaxation we consider is thus 


Rel„(yi;t) = inf 
A>0 


j log( ^ exp( -A 


Z(9f'(yi:s-i),2/s) - + Bn{i) jj + 2\{n-t) 
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Initial Condition: This inequality follows from Lemma 13 and an application of the softmax function as 
an upper bound on the supremum over v. 


-inf inf Y, e(f,yt) + Bnii) 

t 

= sup - Yks'{yi-s-i),ys) + 2^/nR,-Bn{i) 


1 


< inf - log I 

A>0 


— {jjv.n) • 




Admissibility Condition: Define a strategy ql via 

exp(-At {qf' {yv.s-i),ys)- 2^/rrR~ + Bn{i)]) 


{yt)^ = 


Tj exp(^-Xt[T,lJi{q^' iyi-.s-i),ys) - 2^/^j + BniRj)]) 


where we have set 


At = arg min 
A>0 


A 


log ^ exp -A 


Yu{yf"{yi-s-i),ys)-^'JnRi + Bnii) jj + 2A(n-t + l) 


We proceed to demonstrate admissibility: 


inf sup[(gt,yt) + Rel„(?/i;t)] 

gt yt 


llogi 


= inf sup 
yt 


{qt,yt) + inf 

A>0 


Ej9f'(2/ns-i),2/s) - 2y/n^ + S„(z) jj + 2A(n-t) 


We now plug in ql and Aj as described above: 


< sup 


yt L''^t 


^log(exp(AtE,.,*((7f‘(yi:t-i),?/t>)) + ^ log(E,.,. exp(-At ((?f‘(yi:t-i),?/t>)) 


Xj(9f‘(?/ns-i).2/4“2\/n^ + S„(i) jj + 2At*(n-t) 


+ TT log 


We combine the first two terms in the expression and apply Jensen’s inequality to arrive at an upper bound: 


< sup 

yt 


■ exp^At (gf’(yi:t-i) - gf’'(yi:t-i),yt|)) 


+ ^log|Eoxpl--l^J 


E(9f'(yi:s-i)>2/s} - 2^/n^ + Bn{i) j j + 2At (n - t) 


The first term is now bounded using sub-gaussianity. 


< T-log 




= inf 
A>0 


-log(^exp(-A 


Ls=i 
t-i 


Y{ys 'iyi-s-i),ys) - 2\fnR^ + Bn{i) 

1 


)) 


+ 2A^ {ti “ t + 1) 
j j + 2A(n - t + 1) 


= Rel„(?/i;t-i). 
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Having shown that Rel^ is an admissible relaxation, it remains to show that the relaxation’s final value, 

I + 2An 


Rel^(-) = inf 

A>0 


ilog 




= inf 
A>0 


< inf 
A>0 

< inf 
A>0 


is not too large. Setting D = 3, 

^ log|^ exp(-A\/ni?i]j + 2Xn 
The complexity radius Ri is discretized such that Ri - Ri-i > 1, yielding 

■ log|exp(-AA/n) + '^{Ri - exp^-A^/ni?^^ j + 2An 

log^exp(-A\/n) + J exp^-AA/ni?^di?j + 2Xn . 

The integral is a routine calculation. 

J exp^-AV nR^dR = exp^-AV[^AV nR + ij 


Finally, set A = l/\/n yielding 


Rel„(-) < A\/n. 


Note that instead of setting At = XI as described above, we could have set At = ll\/n and achieved the same 
regret bound. □ 

Lemma 13. Consider the experts setting from Example f.5, but with hypothesis class iF{R) = {/ ^ KL{f \ tt) < Rj. 
The following ineguality holds: 

n n 

- inf Y,{yt,f)<-Y,{yt,q^{yv.t-i)) + 2'/Rn. 

t=l t=l 

Proof. Our strategy is to move to an upper bound based on the Kullback-Leibler divergence and exploit 
convex duality: 

n 

- inf T^iytJ) 

feJ^(R) t=l 

<- inf |^(yt,/) + Q!KL(/I 7r)| + ai? 

/6.F(i?) U = 1 J 

- “ inf + aKL(/ I 7^)1 + “-R- 

/e^U=i J 

We use dt* to denote the Fenchel conjugate of KL(- | tt): 

The function KL(- | tt) is 1-strongly convex, which implies that 'F* is 1-strongly smooth. We peel off one 
term at a time: 




1 

+ — . 
a 
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This obtains the following upper bound: 




t=l 


S = 1 


Setting a = ^fnfR and noting that J/s j = Q^ivi-t-i) yields the result. □ 

Proof of Lemma 9. Recall the form of the Ada„ relaxation, where we have abbreviated Rel^ to R^: 

n 

R«(yi,t) - R«0(R«) + 2 ^ 


Ada„( 2 /i,t) = supEe sup 
y.y' R . 


S = t+1 


Initial Condition: This directly follows from the fact that R'^ satisfy the initial condition: 


Ada„(?/i;„) = sup[R^(j/i,„) - R^6»(R^)] 


R 


> sup 


- inf 


R L fiR(R)t=l 


= - inf inf 
R f^R{R) 


Y,iif,yt)+R^eCR^) 


Therefore, playing the strategy corresponding to Ada„ yields an adaptive regret bound of the form Bn{R) = 

Rel«(-)0(Rel^(-))+Ada„(.). 

Admissibility Condition: We obtain the following equalities using the same minimax swap technique as 
in the Lemma 1 proof: 

inf supEyj..,,^ [i{yt,yt) + Ada„(yi;t)] 


9t yt 

= inf sup ^yt~qt sup Ee sup 
Qt yt y.y' fl _ 


n 


e{yt,yt) + R^{yi-.t) - R^0{R^) + ‘2- Y, ^sKys~qf{yi:t,y[^^,,^fe)/{ys,Ya{e)) 


= supEyj..,pj sup Eg sup 
Pt y,y' L fit 

Note that 


s=i+l 

> i?/)/T> R\ 


miEyt^^pJ{yt,yt) + R^{yi,t)-R^0{R^) + 2 Y (!/i,t.yU,_i(0)^(y«>ys(^)) 


S = t+1 


iniEyt^pJ{yt,yt) = inf Ey^^q^Eyt^pJ{yt,y^), 
yt qtsA{T>) 

and we may replace the infimizing distribution with the randomized strategy corresponding to Rel„. 
The fact that this strategy depends on yi-,t-i is left implicit. This yields an upper bound, 

n 

supEp^.p^supE,sup Eyt^^p^E^^^yRe{yt,yt)+R^{yi,t)-R^0{R^) + 2 Y (yi^t.yLi,, ,{e))^iys,yaie)) 

Pt y,y' R L s=t+l 

which we can write by adding and subtracting Ey^^qR£(yt,yt) as 

supEp^.p, supE, sup\Eyt^^p^E^^^gR£{yt,yt) - E^^^yR£(yt,yt) + E^^^yRe(yt,yt) + R^{yi-.t) - R’^diR^) 

y.y' R 


+ 2 E 

S = t+1 
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Now, using the fact that are admissible, 

< supEy^.p, supE, sup\Ey>^^p^E^^^gR£{yt,yt)-E^^^gR£{yt,yt) + R^(i/i:t-i) -R^6i(R^) 

Pt y,y' R 

n 

+ 2 ^ ^s®ys~gf(!/iit,yi+i.,_i(e))^(ys>ys(^)) 

S = t+1 

By Jensen’s inequality, we upper bound the last expression by 

supEp,,pj.p, supE, sup \E^^^gR£{yt,yt) - E^^^gR£(yt,yt) + R^(2/i;t-i) - R^6»(R^) 

Pt y.y' R 

n 

+2 Y, e^^ys~qHyi:t,yU,:,_,{e)Ays,ysie)) 

S = t+1 

We now replace each choice yt in the last sum by a worst-case choice j/": 

< supEp^^pj.p^ sup sup Ee sup \E^^^yR£{yt, y't) - E^^^yR£{yt,yt) + R^(yi;t-i) - R^6<(R^) 
y” y,y' R ^ 


Pt 


^2 E 


s=i+l 


We then introduce et since yt^yi can be renamed. The last expression is equal to 
supEyt,i/;~ptEet sup sup Eg sup \E^^^gR[€ti£iyt,y't) - £iyt,yt))] + 

Pt y'l y.y' R 

n 

s=t+l 

By splitting into two terms we arrive at an upper bound of 

supEpj..,p^E<:t sup sup Ee sup\2etE^^^gR[£{yt,yt)] + R^(i/i;t-i) - R^0(R^) 

Pt y'l y.y' R ^ 

n 

+2 E ^s^ys-qf-iyi-.t-i.yl,y't^i..,_i(<t))^iys^yA)) 

S = t+1 

= supEet sup sup Ee sup\2€tE^^^gR[£{yt,yt)] + R^(j/i;t-i) - R^6»(R^) 
yt yl y.y' R ^ 


+2 E 


S = t+1 


= Ada„(yi;t_i). 


□ 
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